library(tidyverse)The pipes of the tidyverse are all fun and games for
vectors, but they unlock their true potential when applied to data
frames. Furthermore, the tidyverse offers a lot of
functions for “data wrangling”, i.e. cleaning, formatting and processing
data. Let’s load in the nerd data again.
nerd <- read.csv("exercises/data_sets/nerd_data_short.csv", sep = "\t")
# Should the paths above return an error for you, try going to
# Tools -> Global Options -> R Markdown
# and set "evaluate chunks in directory:" to "Project"Since the data contains a lot of information we are not interested in
right now, we would like to select only the columns we want to work with
right now? This is what the select() function does. I want
to keep the participants’ age, the gender, whether they’re married or
not, and all the questions from the nerd questionnaire,
e.g. Q1 - Q26.
nerd %>%
select(age, gender, married, Q1:Q26)#29 columns; no $ in tidyverse#
A few remarkable things just happened. First of all, look at how I’m
accessing the columns: I’m just using their names, without writing the
name of the data frame again. That is, I’m writing age
instead of nerd$age. That is not self-evident at all and a
feature of the tidyverse’s “tidy evaluation”, called “data
masking”. The data source (nerd) is “masked” and only the
column names are visible. The tidyerse “knows” what you are
referring to because you piped the main data source into the
select() function. That makes the code more readable.
And then there’s the fact that we don’t have to type out all of the
columns from Q1 to Q26. When we write
Q1:Q26, this lets the tidyverse know that we
want to select all columns from Q1 to Q26.
Also note how the columns are now displayed in the order we typed
them. That means that select can also be used to re-order your data.
select() also comes with a lot of “helper functions” that
make it even more useful. For example, imagine you want to bring one of
the columns to the front of the data frame, and the rest of the data
frame should come afterwards. You don’t feel like typing all the other
column names, and luckily, there’s a shortcut for that:
everything(). It does exactly what the name suggests: It
selects everything in your data frame (except for the
things you already selected). Say that, for some reason, we wanted the
column voted (whether or not someone voted in the past
election) to go first, then the age, and then all the rest. The
corresponding code looks like this:
nerd %>%
select(voted, age, everything())#reorganize the columns, everything exepted previous mentioned things; everything is a function!, sonst everything as column name#
Don’t forget that everything needs to be used with
parentheses!
2.1) I want to continue working with a reduced version of our data
that only contains the columns age, gender,
married, voted, nerdy and also
Q1 - Q26. Select these columns and save this
reduced data frame in a variable called nerd_red (“nerd
reduced”).
nerd_red <-
nerd %>%
select (age, gender, married, voted, nerdy, Q1:Q26)select() gets you a subset of the columns in your data
frame, but often, you want to filter out specific rows in your data that
fulfill certain criteria. This can be achieved with the
filter() function. Let’s for example only filter out the
female participants, which are coded as 2 in the
gender column.
nerd_red %>%
filter(gender == 2)With filter(), you can use all of the logical
comparisons we’ve seen before. Here, we filter for women below the age
of 30.
nerd_red %>%
filter(gender == 2 & age < 30)#in filter() you can replace “&” with “,”
Interestingly, the logical & and a simple comma are
equivalent in filter().
nerd_red %>%
filter(gender == 2, age < 30)Practice filtering on the nerd_red data. Filter …
2.2) … anyone who is either 22 or 26 years old.
nerd_red %>%
filter(age == 22 | age == 26)2.3) … anyone who is either a man (gender == 1)
or older than 30.
nerd_red %>%
filter(gender == 1 | age > 30)2.4) … anyone who is non-binary (gender == 3) and who is
currently married or has previously been (1 = Never married, 2 =
Currently married, 3 = Previously married).
nerd_red %>%
filter(gender == 3 & (married == 2 | married == 3))nerd_red %>%
filter(gender == 3 & married != 1)2.5) The column nerdy contains the agreement to the
statement “I see myself as nerdy” (1 = disagree strongly - 7 = agree
strongly). How many women have the highest self-reported nerdiness? How
many have the lowest self-reported nerdiness? What about the men? Solve
this task using filters.
nerd_red %>%
filter(gender == 2, nerdy == 7)nerd_red %>%
filter(gender == 2, nerdy == 7) %>%
nrow()## [1] 135
nerd_red %>%
filter(gender == 2, nerdy == 1) %>%
nrow()## [1] 12
nerd_red %>%
filter(gender == 1, nerdy == 7) %>%
nrow()## [1] 87
nerd_red %>%
filter(gender == 1, nerdy == 1) %>%
nrow()## [1] 8
Of course, if we want to know how many people have the highest
nerdiness, there are much more elegant ways to achieve this. Here is how
we can use the count() function to see how many rows (here:
participants) we have for each nerdiness response
nerd_red %>%
count(nerdy)3.1) How many people of each gender are included in the data? Use the codebook to find out what the values represent.
nerd_red %>%
count(gender)3.2) How many people per marital status do we have? Use the codebook for help.
nerd_red %>%
count(married)3.3) How many people voted in the last election, how many people didn’t? Use the codebook for help.
nerd_red %>%
count(voted)Seems like there are few implausible values in our data - and these are not the only columns that are affected.
4.1) What is wrong with the age column?
summary(nerd_red$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 17.00 20.00 63.29 27.00 38822.00
We should filter out all participants whose data are implausible.1
If you have time and want to get a little bit of practice, you can examine each column in the data and take a look at the codebook to find out which values are valid. For the purpose of the workshop, I’ll give you a summary of what’s wrong with the data. If you don’t want any spoilers, stop reading here.
Here are the issues we can identify in our nerd
data:
age.nerdy column, but
values shouldn’t go lower than 1.married valid values are 1 - 3 should be possible
(1 = Never married, 2 = Currently married, 3 = Previously married), but
there are some 0s in the data.voted valid values are 1 or 2 (“Have you voted in a
national election in the past year?”, 1 = Yes, 2 = No), but there are
some 0s in the data.4.2) Filter the nerd_red so it only contains the data
with valid values. (For age, a range from 18 - 100 might be good.)
Overwrite nerd_red with the filtered version.
#check solutions#
4.3) Confirm that the filtering worked by running checks on
nerdy, married, voted and
age.
#check solutions#
We can add or modify columns in our data using the
mutate() function. Since columns are just vectors, we can
work with them in the same way. For example, making all of our
participants 10 years older would work like this:
nerd_red %>%
mutate(ten_years_older = age + 10)Within mutate(), I specify the new column name
ten_years_older, and after an =, I give
instructions about how the content of the new column should be
calculated (age + 10). If I use age as the
column name, the age column will be overwritten:
nerd_red %>%
mutate(age = age + 10)#different from base R, not overwritten# #== comparison#
nerd_red %>%
mutate(age = age + 10)Note, however, that the column is only overwritten in the output -
nerd_red is still intact. If I want to add a column
permanently (!!) to the data frame, I need to overwrite the data frame.
(This is different in base R - when I a column in base
R, that changes the data frame immediately.)
We can also modify columns using functions. For example, remember
that voted is currently coded as 1 (“yes”) and
2 (“no”). We had to consult the codebook to see what the
numbers stand for, so we could make things clearer and convert
voted into a column that contains the values “yes” and “no”
as characters.
For this task, the function ifelse() can help us. It
takes three arguments:
For example, if we want to classify these numbers into “one digit” or “more than one digit”, it looks like this:
some_numbers <- c(2, 24, 12, 5, 1, 353, 1)
ifelse(
some_numbers < 10,
"one",
"more than one"
)## [1] "one" "more than one" "more than one" "one"
## [5] "one" "more than one" "one"
some_numbers < 10## [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE
some_numbers <- c(2, 24, 12, 5, 1, 353, 1)
ifelse(
some_numbers < 10,
"TRUE",
"FALSE"
)## [1] "TRUE" "FALSE" "FALSE" "TRUE" "TRUE" "FALSE" "TRUE"
5.1) Recode voted. Turn 1 into “yes” and 2 into “no”.
Use ifelse() inside of mutate() to overwrite
the voted column so it is permanently
added to nerd_red.
nerd_red <-
nerd_red %>%
mutate(
voted = ifelse(voted == 1, "yes", "no")
)Note that ifelse() only works when we have two
conditions: One where the logical comparison is TRUE and
one where it’s FALSE. For more complicated cases,
ifelse() can’t help us. This is why it was so important
that we got rid of the cases where voted was 0. See what
would have happened if we would have used ifelse() on
voted when it still contained 0s:
voted_example <- c(1, 2, 2, 1, 2, 0, 0, 1)
ifelse(
voted_example == 1,
"yes",
"no"
)## [1] "yes" "no" "no" "yes" "no" "no" "no" "yes"
What if we have more than two cases, as for example in the
gender column (1 = “male”, 2 = “female”, 3 = “other”)? In
this scenario, we use case_when(). It takes several logical
comparisons as argument - one for every case that we want to cover in
our data. For each of these cases, we tell case_when() what
to do using the ~ sign. Take a look at this example:
gender_example <- c(1, 2, 3, 3, 1, 1, 2, 2, 2)
case_when(
gender_example == 1 ~ "male",
gender_example == 2 ~ "female",
gender_example == 3 ~ "other"
)## [1] "male" "female" "other" "other" "male" "male" "female" "female"
## [9] "female"
Whenever the gender_example is 1, we insert “male”, when
it is 2, we put in “female”, and when it is 3, we put in “other”. Note
that it is important to specify all possible cases. If we only had put
in values for the cases 1 and 2 (so, male and female), every occurrence
of 3 would have been left blank (NA).
case_when(
gender_example == 1 ~ "male",
gender_example == 2 ~ "female",
)## [1] "male" "female" NA NA "male" "male" "female" "female"
## [9] "female"
case_when(
gender_example == 1 ~ "male",
gender_example == 2 ~ "female",
TRUE ~ "something"
)## [1] "male" "female" "something" "something" "male" "male"
## [7] "female" "female" "female"
Here is how we can use case_when() to recode the
gender column in our data. Note that we overwrite the
gender column in nerd_red - to achieve this,
we set gender as the new column name (which will overwrite
the old gender column), and also overwrite
nerd_red with the new, modified version of
nerd_red.
nerd_red <-
nerd_red %>%
mutate(
gender = case_when(
gender == 1 ~ "male",
gender == 2 ~ "female",
gender == 3 ~ "other"
)
)
nerd_red In a lot of cases, we want to calculate the sum of columns, e.g. to get a sum score for one of our questionnaires. Adding two columns together is pretty straightforward - you probably already know how to do this.
5.2) Create a column called sum_Q1_Q2 that is the sum of
Q1 and Q2. (Don’t overwrite
nerd_red.)
nerd_red %>%
mutate(sum_Q1_Q2 = Q1 + Q2)Unfortunately, calculating the sum score across the entire
questionnaire from Q1 - Q26 is a little bit
more complicated, at least if we don’t want to type all
of the column names from Q1 - Q26.
nerd_red %>%
rowwise() %>%
mutate(
nerd_score = sum(across(Q1:Q26))
) %>%
ungroup()#across() always when multiplying columns# #ungroup: if not, it’s permanent and it will influence other functions#
First, let’s take a look at what’s going on inside of the
mutate() function. We created a new column
nerd_score - that part is familiar. Next, we use the
sum() function, because we want to calculate a sum - that
makes sense. The part Q1:Q26 is also something you’ve seen
before - we used it to select the columns Q1 -
Q26. What is new is across(). This is another
“helper function” that is only used inside of mutate(). We
can use it to let mutate() work on several columns at once.
If you read the code from left to right, it actually is a nice
representation of our everyday language: We calculate the
sum() across() columns
Q1:Q26.
Another crucial piece of code that is new to us is the
rowwise() function before using mutate(). It
does exactly what the name suggests: It causes all following functions
to be executed separately on every single row of a data frame. That is,
we calculate the sum for every row (= participant). This might seem
slightly confusing because usually, calculations are performed rowwise
anyways (you just saw that when we simply added Q1 and
Q2). sum(), however, is an exception - we have
to force it to work on each row individually. As a last step, we
ungroup() the data. rowwise() permanently
groups the data (each row is a group), and that might have unintended
consequences for other functions later.
A useful function for filtering is %in% (at first
glance, it’s easily confused with the pipe). %in% tells you
for each element on the left whether it also exists on the right. Check
out this code to see how it works:
c("tea", "biscuits", "cake") %in% "tea"## [1] TRUE FALSE FALSE
"tea" %in% c("tea", "biscuits", "cake")## [1] TRUE
Looking back at the exercise where we wanted to filter for all
participants who were either 22 or 26 years old, we can use
%in% like this:
nerd_red %>%
filter(age %in% c(22, 26))Another thing that is very useful is arrange() which
lets you sort your data easily. For example, here is how you would sort
the data by age:
nerd_red %>%
arrange(age)We can also arrange the data in descending order by wrapping the
respective column in desc() (for “descending”).
nerd_red %>%
arrange(desc(age))Sorting by multiple columns also works (try switching the order and examine the result):
nerd_red %>%
arrange(age, gender)We know how to extract a vector from a data frame.
age <- nerd$ageBut how can we extract a single column at the end of a chain of
pipes? This is what the pull() function is for.
nerd_red %>%
filter(age < 20) %>%
pull(age)## [1] 17 18 18 17 15 18 16 15 17 14 17 18 15 16 14 17 15 17 18 19 13 16 17 15 19
## [26] 18 18 14 16 14 17 14 13 16 15 14 18 14 14 15 17 17 19 14 18 19 17 16 15 16
## [51] 15 18 16 16 14 18 17 13 17 14 14 15 17 18 17 14 16 19 19 18 17 15 16 16 18
## [76] 15 18 17 19 17 16 17 18 18 18 16 17 14 19 14 18 16 17 18 15 18 13 17 19 16
## [101] 17 17 17 13 16 18 18 13 18 19 17 13 16 14 19 13 15 16 14 18 13 19 15 15 16
## [126] 13 13 18 19 15 17 17 18 18 17 19 16 18 16 17 19 19 18 18 18 16 13 15 16 18
## [151] 18 18 15 19 19 15 15 16 17 19 18 14 14 19 17 14 13 17 19 17 19 19 14 18 16
## [176] 19 16 17 19 15 14 18 16 16 17 14 16 19 18 14 18 15 14 15 16 16 17 15 17 17
## [201] 14 16 15 16 13 15 15 18 14 19 17 18 18 17 18 18 16 17 16 15 18 19 17 16 18
## [226] 19 18 17 17 16 19 19 18 17 17 19 17 16 15 13 18 17 18 18 18 18 18 16 17 19
## [251] 13 17 18 17 18 15 19 16 18 16 16 16 14 15 18 19 18 19 19 19 19 13 16 17 17
## [276] 17 18 13 17 15 16 18 17 16 18 16 15 18 16 14 18 16 13 15 16 16 16 16 16 16
## [301] 16 18 19 16 19 13 14 15 15 14 15 18 18 15 16 17 19 15 16 18 13 17 18 15 16
## [326] 17 17 17 14 16 17 19 16 17 15 16 15 19 18 18 18 17 16 15 18 15 18 15 16 16
## [351] 16 15 15 13 15 16 15 17 19 18 16 14 19 18 17 18 18 18 15 18 17 19 19 18 19
## [376] 19 17 19 16 19 17 19 16 14 16 19 16 17 18 18 14 18 18 13 16 14 17 17 16 17
## [401] 18 13 13 16 18 17 18 19 15 15 18 18 18 17 17 16 13 18 19 16 18 18 17 18 17
## [426] 17 19 18 16 16 15 17 18 16 15 14 18 17 17 15 15 17 16 18 19 16 16 19 14 13
## [451] 17 16 14 18 13 15 17
You might wonder why we don’t just use select() to do
the job. Here, we try to use select for the same purpose. Can you see
what the problem is?
nerd_red %>%
filter(age < 20) %>%
select(age)When using pull(), what we get back is a vector. When
using select, what we get back is a data frame with one column.
I can control where the new column is added using the
.before (or .after) argument.2.
nerd_red %>%
mutate(
ten_years_older = age + 10,
.after = age
)nerd_red %>%
mutate(
ten_years_older = age + 10,
.before = gender
)THIS IS A GITHUB PUSH CHECKPOINT
Of course, it is bad to come up with exclusion criteria after looking at the data and ideally, the online form would have prevented people from entering implausible age values or excluded them right away. But we’re just practicing to code, so we can take a break from caring about data quality.↩︎
The reason is that inside the mutate()
function, anything that looks like something = whatver
would create a column of the name “something”. That means if we just
wrote before = gender, mutate() would create a
column called “before”, that is a duplicate of “gender”.↩︎